Skip to content

Conversation

niruta25
Copy link
Contributor

@niruta25 niruta25 commented Aug 9, 2025

Preserve object dtype for categories when constructing Categorical from pandas objects

This PR fixes an inconsistency in how pandas infers the dtype of categories when constructing a Categorical from different input types:

When constructing a Categorical from a pandas Series or Index with dtype="object", the categories' dtype is now preserved as object.
When constructing from a NumPy array with dtype="object" or a raw Python sequence, pandas continues to infer the most specific dtype for the categories (e.g., str if all elements are strings).
This change brings the behavior of Categorical in line with how Series and Index handle dtype preservation, making the API more consistent and predictable.

Example

pd.options.future.infer_string = True

ser = pd.Series(["foo", "bar", "baz"], dtype="object")
idx = pd.Index(["foo", "bar", "baz"], dtype="object")
arr = np.array(["foo", "bar", "baz"], dtype="object")
pylist = ["foo", "bar", "baz"]

cat_from_ser = pd.Categorical(ser)
cat_from_idx = pd.Categorical(idx)
cat_from_arr = pd.Categorical(arr)
cat_from_list = pd.Categorical(pylist)

# Series/Index with object dtype: preserve object dtype
assert cat_from_ser.categories.dtype == "object"
assert cat_from_idx.categories.dtype == "object"

# Numpy array or list: infer string dtype
assert cat_from_arr.categories.dtype == "str"
assert cat_from_list.categories.dtype == "str"

Documentation and release notes have been updated.
Closes: #61778

@niruta25 niruta25 changed the title Niruta issue61778 BUG: creating Categorical from pandas Index/Series with "object" dtype infers string Aug 9, 2025
@niruta25
Copy link
Contributor Author

niruta25 commented Aug 9, 2025

@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures.

I see two ways to resolve without changing overall behavior.

  1. Only Preserve object Dtype When All Elements Are Not Strings
  • If the input is a pandas Series/Index with dtype="object", only preserve object dtype for categories if not all elements are strings.
  • If all elements are strings, allow inference to str (the current behavior).
  1. Add a Keyword Argument to Categorical (e.g., preserve_object_dtype=False)
  • Add an explicit option to the Categorical constructor to preserve the object dtype for categories.
  • Default to the current behavior, but allow users to opt in to preservation.

Let me know your thoughts.

def test_groupby_extension_agg(self, as_index, data_for_grouping):
super().test_groupby_extension_agg(as_index, data_for_grouping)

def test_categorical_preserve_object_dtype_from_pandas(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will probably go in tests.arrays.categorical.test_constructors or something similar

def test_categorical_preserve_object_dtype_from_pandas(self):
import numpy as np

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these imports go at the top of the file


import pandas as pd

pd.options.future.infer_string = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.option_context for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

codes, categories = factorize(values, sort=False)
if dtype.ordered:
# raise, as we don't have a sortable data structure and so
# the user should give us one by specifying categories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt the comments were redundant as TypeError already explain it clearly and also new logic is added to detect if the input values is a pandas Series or Index with "object" dtype, and then force the categories to use object dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I do not have any strong preference, I am happy to add it back.

@niruta25
Copy link
Contributor Author

@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures.

I see two ways to resolve without changing overall behavior.

  1. Only Preserve object Dtype When All Elements Are Not Strings
  • If the input is a pandas Series/Index with dtype="object", only preserve object dtype for categories if not all elements are strings.
  • If all elements are strings, allow inference to str (the current behavior).
  1. Add a Keyword Argument to Categorical (e.g., preserve_object_dtype=False)
  • Add an explicit option to the Categorical constructor to preserve the object dtype for categories.
  • Default to the current behavior, but allow users to opt in to preservation.

Let me know your thoughts.

@jbrockmendel Thank you for your comments. I have addressed them. Although I doubt it would help address all the failing tests. Please let me know your thoughts. Any preference from above two options.

@niruta25
Copy link
Contributor Author

@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures.
I see two ways to resolve without changing overall behavior.

  1. Only Preserve object Dtype When All Elements Are Not Strings
  • If the input is a pandas Series/Index with dtype="object", only preserve object dtype for categories if not all elements are strings.
  • If all elements are strings, allow inference to str (the current behavior).
  1. Add a Keyword Argument to Categorical (e.g., preserve_object_dtype=False)
  • Add an explicit option to the Categorical constructor to preserve the object dtype for categories.
  • Default to the current behavior, but allow users to opt in to preservation.

Let me know your thoughts.

@jbrockmendel Thank you for your comments. I have addressed them. Although I doubt it would help address all the failing tests. Please let me know your thoughts. Any preference from above two options.

I tried out the 1st way and all the test cases are passing. Let me know your thoughts.

# Check for pandas Series/ Index with object dtye
preserve_object_dtpe = False
if isinstance(values, (ABCSeries, ABCIndex)):
if getattr(values.dtype, "name", None) == "object":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just check values.dtype == object

) from err

# we're inferring from values
# If we should prserve object dtype, force categories to object dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo prserve -> preserve

# If we should prserve object dtype, force categories to object dtype
if preserve_object_dtpe:
# Only preserve object dtype if not all elements are strings
if not all(isinstance(x, str) for x in categories):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs.
To make sure other functionality doesn't break, I am preserving object datatype when all elements are not string.

@rhshadrach
Copy link
Member

@niruta25 - are you interested in continuing here? If not, I can pick this up.

@niruta25
Copy link
Contributor Author

niruta25 commented Oct 5, 2025

@niruta25 - are you interested in continuing here? If not, I can pick this up.

Hey I would like to continue working in this if it's ok. Should resolve comments by this weekend.

@niruta25
Copy link
Contributor Author

niruta25 commented Oct 6, 2025

@niruta25 - are you interested in continuing here? If not, I can pick this up.

Hey I would like to continue working in this if it's ok. Should resolve comments by this weekend.

@rhshadrach @jbrockmendel addressed the comments. Please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG?: creating Categorical from pandas Index/Series with "object" dtype infers string

3 participants